We’ve already spent time with supervised learning, a model with an outcome variable. Specifically, we dealt with regression and classification.
We used supervised learning for inference (i.e., to understand the underlying data generating process), but now we care about prediction. So instead of letting our theories about the data generating process drive our selection of variables and worrying about whether our coefficient estimates are accurate, we’ll need to run a lot of models and find which one is best for prediction.
Let’s import and work with some new data.
# Load packages. library(tidyverse) library(tidymodels) # Set a simulation seed. set.seed(42)
In this unit, we’ll be examining survey data from iRobot (Roomba).
roomba_survey <- read_csv(here::here("Data", "roomba_survey.csv"))
roomba_survey
## # A tibble: 332 × 128 ## sys_RespNum sys_StartTime sys_EndTime sys_LastQuestion sys_CBC_CBC1_design ## <dbl> <dbl> <dbl> <chr> <chr> ## 1 2 1456893467 1456893958 Finished [[1,1,1,2,1,2,2,1],[2… ## 2 3 1456893643 1456893998 Finished [[1,4,2,1,2,1,1,1],[2… ## 3 4 1456893769 1456893998 Finished [[1,3,2,2,1,2,1,4],[2… ## 4 9 1456895699 1456904874 Finished [[1,2,2,1,1,1,1,3],[2… ## 5 23 1456924935 1456925731 Finished [[1,1,1,2,1,2,1,4],[2… ## 6 24 1456930656 1456932188 Finished [[1,3,2,2,1,1,1,2],[2… ## 7 28 1456943719 1456943970 Finished [[1,4,2,1,2,1,1,1],[2… ## 8 30 1456945961 1456946585 Finished [[1,4,1,1,1,1,2,2],[2… ## 9 31 1456946554 1456946910 Finished [[1,1,2,2,2,1,2,2],[2… ## 10 33 1456946838 1456947219 Finished [[1,1,1,1,2,1,2,3],[2… ## # ℹ 322 more rows ## # ℹ 123 more variables: sys_CBC_CBC1_design_info <chr>, S1 <dbl>, S1A <dbl>, ## # S1B <dbl>, S1C <dbl>, S1C_9_other <chr>, S2 <dbl>, S3Age <dbl>, ## # S4Income <dbl>, CleaningAttitudes_1 <dbl>, CleaningAttitudes_2 <dbl>, ## # CleaningAttitudes_3 <dbl>, CleaningAttitudes_4 <dbl>, ## # CleaningAttitudes_5 <dbl>, CleaningAttitudes_6 <dbl>, ## # CleaningAttitudes_7 <dbl>, CleaningAttitudes_8 <dbl>, …
# Answers to S1? This is Q1 in the survey dictionary, i.e., the first screening question. roomba_survey |> count(S1)
## # A tibble: 3 × 2 ## S1 n ## <dbl> <int> ## 1 1 40 ## 2 3 63 ## 3 4 229
Going forward, we will perform feature engineering on our outcome variable first. In other words, we pre-process our outcome variable first, then split and pre-process our training and testing data.
# Wrangle S1 into segment.
roomba_survey <- roomba_survey |>
rename(segment = S1) |>
mutate(
# easier way to do multiple if-else statements!
segment = case_when(
segment == 1 ~ "own",
segment == 3 ~ "shopping",
segment == 4 ~ "considering"
),
segment = factor(segment)
)